Document Content Inventory & Retrieval
نویسندگان
چکیده
We give an analysis of relationships between expected retrieval performance and classification recognition accuracy in the context of document image content extraction and inventory. By content extraction we mean location and measurement of regions containing handwriting, machineprinted text, photographs, blank space, etc, in documents represented as bilevel, grey-level, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80–90%) for retrieval queries within document collections seeking pages that contain a given minimum fraction of a certain type of content. In an effort to elucidate this interesting empirical result, we have analyzed the interdependency of classification and retrieval under a variety of assumptions about the distribution of content types in document image collections. We show that under general conditions we cannot derive precision and recall measures from per-pixel classification measures alone, but we can estimate the expected values of these measures. If however the distribution of content and error rates are uniform across the entire collection, our results suggest, it is possible to predict precision and recall measures from classification accuracy and vice versa.
منابع مشابه
Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملDocument image content inventories
We report an investigation into strategies, algorithms, and software tools for document image content extraction and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds of documents represented as bilevel, greylevel, or color images, tha...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملبررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007